Lab One - Exploring Table Data¶
- Group:
Lab One 3- Salissa Hernandez
- Juan Carlos Dominguez
- Leonardo Piedrahita
- Brice Danvide
1. Business Understanding¶
The
Hotel Bookings Datasetcontains a comprehensive collection of information pertaining to hotel bookings, cancellations, and guests' details for Resort and City Hotels. From the dataset's description, "it provides insights into various aspects such as the type of hotel, the number of adults, children, and babies per booking, the length of stay, the meal plan chosen by guests," and other features. The travel industry is booming following the COVID restrictions that were in place over the past few years. However, cancellations are inevitable and can negatively impact hotel operations. The provided data in this dataset can facilitate analysis by examining the factors that are likely to lead to a cancellation, such as long lead times, the average daily rate, or even past cancellation behaviors. This dataset aims to identify these trends to predict whether a cancellation will occur or not.The dataset contains 119,390 rows and 33 features. The target feature is
is_canceled, which indicates whether a reservation was canceled or not. Although the methods for acquiring this dataset are not explicitly stated, it is reasonable to infer that it was collected by hotel management systems or third-party booking platforms. The purpose for its collection was likely to assess the factors and trends leading to cancellations in reservations. Cancellations can financially impact hotel operations, especially if they occur last minute, after hotels have already allocated their budgets.The prediction task for this dataset is to predict whether a cancellation will happen or not based on trends or patterns identified in the data. For example, if city hotels experience more cancellations than resort hotels, it could indicate guests value resort hotels more than city hotels. Accurate predictions would be beneficial to various stakeholders:
- Hotel Managers: They could use the model to minimize revenue loss and optimize operations by adjusting pricing or offering last-minute deals for canceled rooms.
- Marketing Teams: Insights from cancellation trends could help marketing departments design campaigns that attract guests less likely to cancel, or improve service offerings to reduce cancellation rates.
- Local Businesses: Hotels often operate near popular destinations (e.g., Disneyland or downtown areas), so reducing cancellations ensures more tourists stay, benefiting nearby businesses.
Measure of Success¶
To determine whether the prediction algorithm is successful, several criteria should be considered:
Accuracy: The model should perform significantly better than random chance (50%). A success rate of over 90% would indicate that the model is effective in predicting cancellations, ensuring more reliable business decisions. An accuracy of 90% or higher would help hotels confidently adjust their overbooking strategies, dynamically price rooms, and increase operational efficiency.
Precision and Recall:
- Precision: It is crucial to avoid incorrect predictions of cancellations (false positives) as these could lead to overbooking and customer dissatisfaction.
- Recall: Ensuring that most cancellations are correctly predicted (high recall) minimizes unexpected losses due to no-shows.
- A balanced F1 score (harmonic mean of precision and recall) would indicate an optimal trade-off between avoiding false positives and false negatives, making the model more practical for real-world use.
Practical Impact: The model needs to be robust and flexible to handle real-world variability. For example, it should accommodate fluctuating booking patterns during peak seasons and varying customer behavior. Additionally, a successful model should help hotels maintain occupancy rates close to industry benchmarks (typically around 70%-80%) by dynamically adjusting prices for canceled rooms or offering last-minute deals to minimize revenue loss.
Real-World Examples and Benchmarks¶
Overbooking Strategies: Many hotels rely on predictive models to allow for overbooking, ensuring occupancy rates near 100%. Accurately predicting cancellations enables hotels to avoid losing revenue from empty rooms while preventing too many customers from arriving with insufficient room availability.
Revenue Management: Accurate cancellation predictions can lead to a 10-15% improvement in overall hotel revenue by optimizing pricing strategies and maximizing room availability.
Operational Efficiency: By predicting cancellations and adjusting for them in advance, hotels can better allocate their resources, adjust staffing levels, and reduce operational costs. For instance, if a high probability of cancellations is predicted for a particular weekend, hotel management can optimize staff scheduling and avoid overstaffing.
Dataset Resources:
- https://www.kaggle.com/datasets/thedevastator/hotel-bookings-analysis/data
- https://data.world/mesum/hotel-bookings-dataset/workspace/file?filename=hotel_bookings.csv
Kaggle Dataset: Hotel Bookings Analysis contains data collected from a city hotel and resort hotel with the purpose of analyzing booking patterns and cancellations to optimize operations and improve customer satisfaction.
data.world Dataset:Hotel Bookings is similar to Kagga Dataset, including information such as bookings and customer demographics. This set was gathered to analyze booking trends and predict cancellations to optimize operations.
2. Data Understanding¶
# Modules & Libraries
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
from matplotlib.dates import DateFormatter, MonthLocator
import matplotlib.dates as mdates
import copy
import umap.umap_ as umap
from sklearn.preprocessing import StandardScaler
from matplotlib.animation import FuncAnimation
2.1 Loading the Dataset & Defining Data Types¶
# Loading the dataset
path = '../Data/hotel_bookings.csv'
# Read in csv file
df = pd.read_csv(path)
df.head()
| hotel | is_canceled | lead_time | arrival_date_year | arrival_date_month | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | ... | deposit_type | agent | company | days_in_waiting_list | customer_type | adr | required_car_parking_spaces | total_of_special_requests | reservation_status | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 01-07-15 |
| 1 | Resort Hotel | 0 | 737 | 2015 | July | 27 | 1 | 0 | 0 | 2 | ... | No Deposit | NaN | NaN | 0 | Transient | 0.0 | 0 | 0 | Check-Out | 01-07-15 |
| 2 | Resort Hotel | 0 | 7 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | NaN | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 02-07-15 |
| 3 | Resort Hotel | 0 | 13 | 2015 | July | 27 | 1 | 0 | 1 | 1 | ... | No Deposit | 304.0 | NaN | 0 | Transient | 75.0 | 0 | 0 | Check-Out | 02-07-15 |
| 4 | Resort Hotel | 0 | 14 | 2015 | July | 27 | 1 | 0 | 2 | 2 | ... | No Deposit | 240.0 | NaN | 0 | Transient | 98.0 | 0 | 1 | Check-Out | 03-07-15 |
5 rows × 32 columns
# Note that the describe function defaults to using only some variables
df.describe()
| is_canceled | lead_time | arrival_date_year | arrival_date_week_number | arrival_date_day_of_month | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | previous_bookings_not_canceled | booking_changes | agent | company | days_in_waiting_list | adr | required_car_parking_spaces | total_of_special_requests | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119386.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 103050.000000 | 6797.000000 | 119390.000000 | 119390.000000 | 119390.000000 | 119390.000000 |
| mean | 0.370416 | 104.011416 | 2016.156554 | 27.165173 | 15.798241 | 0.927599 | 2.500302 | 1.856403 | 0.103890 | 0.007949 | 0.031912 | 0.087118 | 0.137097 | 0.221124 | 86.693382 | 189.266735 | 2.321149 | 101.831122 | 0.062518 | 0.571363 |
| std | 0.482918 | 106.863097 | 0.707476 | 13.605138 | 8.780829 | 0.998613 | 1.908286 | 0.579261 | 0.398561 | 0.097436 | 0.175767 | 0.844336 | 1.497437 | 0.652306 | 110.774548 | 131.655015 | 17.594721 | 50.535790 | 0.245291 | 0.792798 |
| min | 0.000000 | 0.000000 | 2015.000000 | 1.000000 | 1.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.000000 | 6.000000 | 0.000000 | -6.380000 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 18.000000 | 2016.000000 | 16.000000 | 8.000000 | 0.000000 | 1.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 9.000000 | 62.000000 | 0.000000 | 69.290000 | 0.000000 | 0.000000 |
| 50% | 0.000000 | 69.000000 | 2016.000000 | 28.000000 | 16.000000 | 1.000000 | 2.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 14.000000 | 179.000000 | 0.000000 | 94.575000 | 0.000000 | 0.000000 |
| 75% | 1.000000 | 160.000000 | 2017.000000 | 38.000000 | 23.000000 | 2.000000 | 3.000000 | 2.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 229.000000 | 270.000000 | 0.000000 | 126.000000 | 0.000000 | 1.000000 |
| max | 1.000000 | 737.000000 | 2017.000000 | 53.000000 | 31.000000 | 19.000000 | 50.000000 | 55.000000 | 10.000000 | 10.000000 | 1.000000 | 26.000000 | 72.000000 | 21.000000 | 535.000000 | 543.000000 | 391.000000 | 5400.000000 | 8.000000 | 5.000000 |
# Returns the dimensions of the DataFrame as (number of rows, number of columns)
df.shape
(119390, 32)
# Returns an index object containing the column labels of the DataFrame.
df.columns
Index(['hotel', 'is_canceled', 'lead_time', 'arrival_date_year',
'arrival_date_month', 'arrival_date_week_number',
'arrival_date_day_of_month', 'stays_in_weekend_nights',
'stays_in_week_nights', 'adults', 'children', 'babies', 'meal',
'country', 'market_segment', 'distribution_channel',
'is_repeated_guest', 'previous_cancellations',
'previous_bookings_not_canceled', 'reserved_room_type',
'assigned_room_type', 'booking_changes', 'deposit_type', 'agent',
'company', 'days_in_waiting_list', 'customer_type', 'adr',
'required_car_parking_spaces', 'total_of_special_requests',
'reservation_status', 'reservation_status_date'],
dtype='object')
# Provides a concise summary of the DataFrame including data types, non-null values, and memory usage.
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119390 entries, 0 to 119389 Data columns (total 32 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 119390 non-null object 1 is_canceled 119390 non-null int64 2 lead_time 119390 non-null int64 3 arrival_date_year 119390 non-null int64 4 arrival_date_month 119390 non-null object 5 arrival_date_week_number 119390 non-null int64 6 arrival_date_day_of_month 119390 non-null int64 7 stays_in_weekend_nights 119390 non-null int64 8 stays_in_week_nights 119390 non-null int64 9 adults 119390 non-null int64 10 children 119386 non-null float64 11 babies 119390 non-null int64 12 meal 119390 non-null object 13 country 118902 non-null object 14 market_segment 119390 non-null object 15 distribution_channel 119390 non-null object 16 is_repeated_guest 119390 non-null int64 17 previous_cancellations 119390 non-null int64 18 previous_bookings_not_canceled 119390 non-null int64 19 reserved_room_type 119390 non-null object 20 assigned_room_type 119390 non-null object 21 booking_changes 119390 non-null int64 22 deposit_type 119390 non-null object 23 agent 103050 non-null float64 24 company 6797 non-null float64 25 days_in_waiting_list 119390 non-null int64 26 customer_type 119390 non-null object 27 adr 119390 non-null float64 28 required_car_parking_spaces 119390 non-null int64 29 total_of_special_requests 119390 non-null int64 30 reservation_status 119390 non-null object 31 reservation_status_date 119390 non-null object dtypes: float64(4), int64(16), object(12) memory usage: 29.1+ MB
Attributes Kept¶
We narrowed down the dataset from 32 attributes to 11 key features that closely align with our objective and prediction task.
Aggregated Features:
num_of_guests: This combinesadults,children, andbabiesbecause the total number of guests is more likely to influence cancellations than each guest type separately.length_of_stay: This mergesstays_in_weekend_nightsandstays_in_week_nightsto reflect the total duration of the stay, which is more significant than the specific days of the week.
Individual Attributes:
hotel: This captures whether the booking is for a resort or city hotel, which is important for understanding cancellation trends across different types of hotels (i.e. vacations vs. business trips).is_repeated_guest,previous_cancellations,booking_changes: These provide insights into guest behavior, including past cancellations and modifications, which could predict future cancellations.adr: The average daily rate paid by guests, which helps reveal if cancellations are more likely at certain price points.deposit_type: Reflects whether a deposit was made, which is important in determining the likelihood of follow-through on the reservation.lead_time: The amount of time between booking and arrival, a key factor in understanding guest decision patterns.reservation_status_date: We wanted to investigate if particular months in the year yield higher cancellation rates.- This will be converted to
month_yearin order to extract monthly data per year.
- This will be converted to
customer_type: We wanted to analyze if cancellations were more common in particular types of customers.
These attributes were selected based on their relevance to predicting cancellations and their potential to provide meaningful insights into guest behaviors and booking patterns.
Attributes Collected in Data¶
Categorical:¶
hotel: Indicates the type of hotel (resort or city).- bool
- 0=Resort Hotel, 1=City Hotel
is_canceled: Specifies whether the booking was canceled or not.- 0=not canceled, 1=canceled
- target: int
is_repeated_guest: Indicates whether a guest is a repeated visitor.- 0=not repeated, 1=repeated
- bool
deposit_type: Indicates the type of deposit made for the booking.- One-hot encoded
- 3 values: 'No Deposit', 'Refundable', 'Non Refund'
month_year: Date at which last status was set- period[M] to denote monthly frequency for our analysis
customer_type: Indicates the type of customers that are booking these reservations. Whether it is a single person, a group of people, families, the categorical feature indicates most customer might be out of state or a mix of both.- object
Numerical:¶
lead_time: Represents the number of days between the booking date and the arrival date.- int
length_of_stay: Number of nights spent in total.- int
num_of_guests: Total number of guests under reservation.- int
previous_cancellations: Number of times the guest previously canceled their bookings.- int
booking_changes: Number of changes made to the booking.- int
adr: Represents the average daily rate (price per room) for the booking.- float
Attributes Dropped¶
Irrelvant:¶
agent,company,market_segment, anddistribution_channelwill likely not be directly relevant to predicting cancellations.
Reundancy:¶
arrival_date_year,arrival_date_month,arrival_date_week_number, andarrival_date_day_of_monthall describe the same arrival date in different form, making it redundant to keep all of them.
Low Impact:¶
required_car_parking_spaces,days_in_waiting_list, andtotal_of_special_requestsdo not yield enough predictive power toward our prediction task.
Data Leakage:¶
reservation_statusleaks the outcome of booking (cancelations), feeding our model the outcome instead of allowing it to fully predict.
# Attributes Dropped
df.drop(['arrival_date_year', 'arrival_date_month',
'arrival_date_week_number',
'arrival_date_day_of_month',
'agent',
'company',
'required_car_parking_spaces',
'meal',
'country',
'market_segment',
'distribution_channel',
'previous_bookings_not_canceled',
'reserved_room_type',
'assigned_room_type',
'days_in_waiting_list',
'required_car_parking_spaces',
'reservation_status',
'total_of_special_requests'
],
axis=1,
inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 119390 entries, 0 to 119389 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 119390 non-null object 1 is_canceled 119390 non-null int64 2 lead_time 119390 non-null int64 3 stays_in_weekend_nights 119390 non-null int64 4 stays_in_week_nights 119390 non-null int64 5 adults 119390 non-null int64 6 children 119386 non-null float64 7 babies 119390 non-null int64 8 is_repeated_guest 119390 non-null int64 9 previous_cancellations 119390 non-null int64 10 booking_changes 119390 non-null int64 11 deposit_type 119390 non-null object 12 customer_type 119390 non-null object 13 adr 119390 non-null float64 14 reservation_status_date 119390 non-null object dtypes: float64(2), int64(9), object(4) memory usage: 13.7+ MB
2.2 Verifying Data Quality¶
Handling Missing and Duplicated Data¶
Before proceeding with feature aggregation, it’s crucial to ensure data quality by addressing missing and duplicated values.
Missing Data:¶
- Identified Issue: The
childrenattribute had 4 missing values out of 119,390 entries. These missing values were likely caused by users leaving the "number of children" field blank during the booking process. - Analysis and Visualization: To verify this, we visualized missing data using a bar plot, which showed that only the
childrencolumn had missing entries, and the extent of missingness was minimal. - Justification for Imputation: Given that there were only 4 missing values and that these likely represent bookings without children, we imputed the missing values with
0. Since the impact is minor and the missing data represents less than 0.01% of the total entries, this method prevents skewing the dataset without requiring more advanced imputation techniques like KNN, which would be unnecessary for such a small fraction.
Duplicated Data:¶
- Identified Issue: We detected 41,787 duplicate entries out of 119,390 total entries (about 35% of the data).
- Potential Causes: Duplicates may have resulted from merging different systems or accidental/intentional multiple bookings by the same guests.
- Visualization: We confirmed the presence of duplicates by simply looking at the first 6 entries in the dataset and spotting a duplicate in the 5th and 6th entries.
- Justification for Elimination: After examining the dataset, we decided to drop all duplicates. The reason for this is to prevent over-representation of certain guests or bookings, which could introduce bias in our model. By removing these 41,787 duplicate entries, we ensure that each entry represents a unique booking instance. After cleaning, we have over 70,000 unique entries, which remains a robust dataset for analysis.
# Check for missing values
df.isnull().sum()
hotel 0 is_canceled 0 lead_time 0 stays_in_weekend_nights 0 stays_in_week_nights 0 adults 0 children 4 babies 0 is_repeated_guest 0 previous_cancellations 0 booking_changes 0 deposit_type 0 customer_type 0 adr 0 reservation_status_date 0 dtype: int64
# Counting the missing values per col/attribute
missing_counts = df.isnull().sum()
# Set figure size
plt.figure(figsize=(10, 6))
# Plot missing values
missing_counts[missing_counts > 0].plot(kind='bar')
# Set title labels, rotate x-axis
plt.title('Missing Values Count by Column')
plt.xlabel('Column')
plt.ylabel('Number of Missing Values')
plt.xticks(rotation=45)
# Display
plt.show()
# Checking for Duplicates
df.duplicated().sum()
np.int64(36256)
# Example: Last two entries here are duplicates!
df.head(6)
| hotel | is_canceled | lead_time | stays_in_weekend_nights | stays_in_week_nights | adults | children | babies | is_repeated_guest | previous_cancellations | booking_changes | deposit_type | customer_type | adr | reservation_status_date | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Resort Hotel | 0 | 342 | 0 | 0 | 2 | 0.0 | 0 | 0 | 0 | 3 | No Deposit | Transient | 0.0 | 01-07-15 |
| 1 | Resort Hotel | 0 | 737 | 0 | 0 | 2 | 0.0 | 0 | 0 | 0 | 4 | No Deposit | Transient | 0.0 | 01-07-15 |
| 2 | Resort Hotel | 0 | 7 | 0 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | No Deposit | Transient | 75.0 | 02-07-15 |
| 3 | Resort Hotel | 0 | 13 | 0 | 1 | 1 | 0.0 | 0 | 0 | 0 | 0 | No Deposit | Transient | 75.0 | 02-07-15 |
| 4 | Resort Hotel | 0 | 14 | 0 | 2 | 2 | 0.0 | 0 | 0 | 0 | 0 | No Deposit | Transient | 98.0 | 03-07-15 |
| 5 | Resort Hotel | 0 | 14 | 0 | 2 | 2 | 0.0 | 0 | 0 | 0 | 0 | No Deposit | Transient | 98.0 | 03-07-15 |
Converting to Appropriate Data Types¶
- Now that we've discovered missing and duplicated data, we can proceed to impute and eliminate, respectively.
- But first, we'll make sure to convert to appropriate data types if needed!
- We also need to aggregate a few attributes
# Convert hotel to boolean
df['hotel'] = df['hotel'].map({'Resort Hotel': 0, 'City Hotel': 1}).astype('bool')
# Convert is_repeated_guest to boolean
df['is_repeated_guest'] = df['is_repeated_guest'].astype('bool')
# One hot encoding deposit_type, indicating presence (1) or absence (0).
if 'deposit_type' in df.columns:
df = pd.get_dummies(df, columns=['deposit_type'], prefix=['deposit_type'])
# Imputing misisng values, children attribute with 0
df['children'] = df['children'].fillna(0).astype('int')
# Converting reservation_status_date to month_year
df['reservation_status_date'] = pd.to_datetime(df['reservation_status_date'], format='%d-%m-%y', errors='coerce')
# Create month_year column in monthly period format
df['month_year'] = df['reservation_status_date'].dt.to_period('M')
# Filter Dataframe
start_period = pd.Period('2015-07', freq='M')
df = df[df['month_year'] >= start_period]
# Aggregation of 'adults', 'children', and 'babies' into 'num_of_guests'
df['num_of_guests'] = df['adults'] + df['children'] + df['babies']
# Aggregation of 'stays_in_weekend_nights' and 'stays_in_week_nights' into 'length_of_stay'
df['length_of_stay'] = df['stays_in_weekend_nights'] + df['stays_in_week_nights']
# Dropping the attributes we aggregated
df = df.drop(columns=['adults', 'children', 'babies', 'stays_in_weekend_nights', 'stays_in_week_nights'])
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 117040 entries, 0 to 119389 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 117040 non-null bool 1 is_canceled 117040 non-null int64 2 lead_time 117040 non-null int64 3 is_repeated_guest 117040 non-null bool 4 previous_cancellations 117040 non-null int64 5 booking_changes 117040 non-null int64 6 customer_type 117040 non-null object 7 adr 117040 non-null float64 8 reservation_status_date 117040 non-null datetime64[ns] 9 deposit_type_No Deposit 117040 non-null bool 10 deposit_type_Non Refund 117040 non-null bool 11 deposit_type_Refundable 117040 non-null bool 12 month_year 117040 non-null period[M] 13 num_of_guests 117040 non-null int64 14 length_of_stay 117040 non-null int64 dtypes: bool(5), datetime64[ns](1), float64(1), int64(6), object(1), period[M](1) memory usage: 10.4+ MB
# No more null values!
df.isnull().sum()
hotel 0 is_canceled 0 lead_time 0 is_repeated_guest 0 previous_cancellations 0 booking_changes 0 customer_type 0 adr 0 reservation_status_date 0 deposit_type_No Deposit 0 deposit_type_Non Refund 0 deposit_type_Refundable 0 month_year 0 num_of_guests 0 length_of_stay 0 dtype: int64
# Dropping Duplicates
df.drop_duplicates(inplace=True)
# No more duplicates!
df.duplicated().sum()
np.int64(0)
df.info()
<class 'pandas.core.frame.DataFrame'> Index: 82363 entries, 0 to 119389 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 82363 non-null bool 1 is_canceled 82363 non-null int64 2 lead_time 82363 non-null int64 3 is_repeated_guest 82363 non-null bool 4 previous_cancellations 82363 non-null int64 5 booking_changes 82363 non-null int64 6 customer_type 82363 non-null object 7 adr 82363 non-null float64 8 reservation_status_date 82363 non-null datetime64[ns] 9 deposit_type_No Deposit 82363 non-null bool 10 deposit_type_Non Refund 82363 non-null bool 11 deposit_type_Refundable 82363 non-null bool 12 month_year 82363 non-null period[M] 13 num_of_guests 82363 non-null int64 14 length_of_stay 82363 non-null int64 dtypes: bool(5), datetime64[ns](1), float64(1), int64(6), object(1), period[M](1) memory usage: 7.3+ MB
# Reset the index just so it's easier to read the rows/entries
df = df.reset_index(drop=True)
df.tail()
| hotel | is_canceled | lead_time | is_repeated_guest | previous_cancellations | booking_changes | customer_type | adr | reservation_status_date | deposit_type_No Deposit | deposit_type_Non Refund | deposit_type_Refundable | month_year | num_of_guests | length_of_stay | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 82358 | True | 0 | 23 | False | 0 | 0 | Transient | 96.14 | 2017-09-06 | True | False | False | 2017-09 | 2 | 7 |
| 82359 | True | 0 | 102 | False | 0 | 0 | Transient | 225.43 | 2017-09-07 | True | False | False | 2017-09 | 3 | 7 |
| 82360 | True | 0 | 34 | False | 0 | 0 | Transient | 157.71 | 2017-09-07 | True | False | False | 2017-09 | 2 | 7 |
| 82361 | True | 0 | 109 | False | 0 | 0 | Transient | 104.40 | 2017-09-07 | True | False | False | 2017-09 | 2 | 7 |
| 82362 | True | 0 | 205 | False | 0 | 0 | Transient | 151.20 | 2017-09-07 | True | False | False | 2017-09 | 2 | 9 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 82363 entries, 0 to 82362 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 82363 non-null bool 1 is_canceled 82363 non-null int64 2 lead_time 82363 non-null int64 3 is_repeated_guest 82363 non-null bool 4 previous_cancellations 82363 non-null int64 5 booking_changes 82363 non-null int64 6 customer_type 82363 non-null object 7 adr 82363 non-null float64 8 reservation_status_date 82363 non-null datetime64[ns] 9 deposit_type_No Deposit 82363 non-null bool 10 deposit_type_Non Refund 82363 non-null bool 11 deposit_type_Refundable 82363 non-null bool 12 month_year 82363 non-null period[M] 13 num_of_guests 82363 non-null int64 14 length_of_stay 82363 non-null int64 dtypes: bool(5), datetime64[ns](1), float64(1), int64(6), object(1), period[M](1) memory usage: 6.7+ MB
One Last Thing: Checking For Outliers¶
df.describe()
| is_canceled | lead_time | previous_cancellations | booking_changes | adr | reservation_status_date | num_of_guests | length_of_stay | |
|---|---|---|---|---|---|---|---|---|
| count | 82363.000000 | 82363.000000 | 82363.000000 | 82363.000000 | 82363.000000 | 82363 | 82363.000000 | 82363.000000 |
| mean | 0.280672 | 78.356434 | 0.026468 | 0.273120 | 107.326994 | 2016-09-07 04:31:01.826305536 | 2.039265 | 3.661280 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | -6.380000 | 2015-07-01 00:00:00 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 11.000000 | 0.000000 | 0.000000 | 72.750000 | 2016-03-25 00:00:00 | 2.000000 | 2.000000 |
| 50% | 0.000000 | 48.000000 | 0.000000 | 0.000000 | 99.000000 | 2016-09-13 00:00:00 | 2.000000 | 3.000000 |
| 75% | 1.000000 | 123.000000 | 0.000000 | 0.000000 | 135.000000 | 2017-03-08 00:00:00 | 2.000000 | 5.000000 |
| max | 1.000000 | 737.000000 | 13.000000 | 21.000000 | 5400.000000 | 2017-09-14 00:00:00 | 20.000000 | 69.000000 |
| std | 0.449330 | 84.411887 | 0.298818 | 0.736445 | 55.641959 | NaN | 0.712955 | 2.785946 |
- There seems to be extreme values found in the following attributes:
lead_timewith a max of 737 and mean of 84.37, andadrwith a max of 5400 and mean of 110.14.- These two can have the most significant negative impact on the analysis of predicting cancellations as it may bias towards these values.
- In addition,
adrhas a min value of -6.38 which seems like a data entry error unless the hotel paid someone $6.38 to reserve a room (highly unlikely)- In this case, we'll ensure
adrvalues are greater than 0 and remove any entries that are not.
- In this case, we'll ensure
- We decided to apply
IQRaslead_timeandadrare skewed and using this method would be more robust.lead_time(most people book within a short time, but a few might book very far in advanced)adr(high-end bookings or promotions that could cause large deviations)
# Resolving the negative value in adr
df = df[df['adr'] >= 0]
# IQR Function
def remove_outliers_iqr(df, column):
# Calculate the first and third quartile
Q1 = df[column].quantile(0.25)
Q3 = df[column].quantile(0.75)
## Compute the Interquartile Range (IQR)
IQR = Q3 - Q1
# Define lower and upper bounds for outlier detection
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
# Filter the DataFrame to remove outliers
df= df[(df[column] >= lower_bound) & (df[column] <= upper_bound)]
return df
# Define columns for which outlier to remove
columns_to_filter = ['lead_time', 'adr']
# Iterate and apply function
for col in columns_to_filter:
df = remove_outliers_iqr(df, col)
df = df.reset_index(drop=True)
# Extreme outliers from `lead_time` and `adr` have been removed!
df.describe()
| is_canceled | lead_time | previous_cancellations | booking_changes | adr | reservation_status_date | num_of_guests | length_of_stay | |
|---|---|---|---|---|---|---|---|---|
| count | 78017.000000 | 78017.000000 | 78017.000000 | 78017.000000 | 78017.000000 | 78017 | 78017.000000 | 78017.000000 |
| mean | 0.272133 | 71.019803 | 0.026751 | 0.263763 | 103.314852 | 2016-09-01 20:40:41.744748032 | 2.010357 | 3.598036 |
| min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 2015-07-01 00:00:00 | 0.000000 | 0.000000 |
| 25% | 0.000000 | 10.000000 | 0.000000 | 0.000000 | 72.250000 | 2016-03-21 00:00:00 | 2.000000 | 2.000000 |
| 50% | 0.000000 | 45.000000 | 0.000000 | 0.000000 | 98.000000 | 2016-09-08 00:00:00 | 2.000000 | 3.000000 |
| 75% | 1.000000 | 115.000000 | 0.000000 | 0.000000 | 132.000000 | 2017-03-02 00:00:00 | 2.000000 | 5.000000 |
| max | 1.000000 | 291.000000 | 13.000000 | 21.000000 | 230.000000 | 2017-09-14 00:00:00 | 12.000000 | 69.000000 |
| std | 0.445061 | 72.372885 | 0.304994 | 0.719954 | 46.218942 | NaN | 0.680556 | 2.745294 |
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 78017 entries, 0 to 78016 Data columns (total 15 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 hotel 78017 non-null bool 1 is_canceled 78017 non-null int64 2 lead_time 78017 non-null int64 3 is_repeated_guest 78017 non-null bool 4 previous_cancellations 78017 non-null int64 5 booking_changes 78017 non-null int64 6 customer_type 78017 non-null object 7 adr 78017 non-null float64 8 reservation_status_date 78017 non-null datetime64[ns] 9 deposit_type_No Deposit 78017 non-null bool 10 deposit_type_Non Refund 78017 non-null bool 11 deposit_type_Refundable 78017 non-null bool 12 month_year 78017 non-null period[M] 13 num_of_guests 78017 non-null int64 14 length_of_stay 78017 non-null int64 dtypes: bool(5), datetime64[ns](1), float64(1), int64(6), object(1), period[M](1) memory usage: 6.3+ MB
Data Understanding Summary¶
- Data Types: All attributes are now represented in their appropriate data types.
- Feature Aggregation: Aggregated attributes as needed for efficiency.
- Missing Values: Imputed missing values to ensure completeness.
- Duplicates: Removed duplicate entries to avoid over-representation and bias.
- Outliers: Removed extreme outliers present in a few attributes
3. Data Visualization¶
3.1 Visualizing Basic Feature Distributions¶
# Cross Tabulation Showing Cancellation Counts Among Numerical Attributes
# List of numerical columns to analyze
numerical = ['lead_time', 'previous_cancellations', 'booking_changes', 'adr', 'num_of_guests', 'length_of_stay']
# Create copy of df to not modify original data
crosstab_df = copy.deepcopy(df)
crosstab_df['is_canceled'] = df['is_canceled'].map({0: 'Not Canceled', 1: 'Canceled'})
# Num of numerical columns
L = len(numerical)
# Num of coloumns in grid and calculate num of rows
ncol = 3
nrow = int(np.ceil(L / ncol))
# Setup matplotlib and axis
fig, axes = plt.subplots(nrow, ncol, figsize=(15, 5 * nrow))
fig.subplots_adjust(top=0.92)
axes = axes.flatten()
# Loop over numerical column to create histogram and stack
for i, col in enumerate(numerical):
sns.histplot(data=crosstab_df, x=col, hue="is_canceled", multiple="stack", ax=axes[i], bins=30, palette="coolwarm")
# Set title, labels, enable grid
axes[i].set_title(f'Distribution of {col}', fontsize=14)
axes[i].set_xlabel(f'{col} Value', fontsize=12)
axes[i].set_ylabel('Frequency', fontsize=12)
axes[i].grid(True, which='both', linestyle='--', linewidth=0.5)
sns.despine(ax=axes[i])
# If empty subplots delete axes
for j in range(L, nrow * ncol):
fig.delaxes(axes[j])
# Cleaner look and display
plt.tight_layout()
plt.show()
Analysis:¶
lead_time: non-cancellations have a higher frequency than cancellations at low lead times; as lead time increases, both cancellations and noncancellations gradually decrease, but cancellations tend to remain at similar levels throughout higher lead times.previous_cancellations: guests with few previous or no previous cancellations are less likely to cancel their reservationsbooking_changes: guests with few booking changes are less likely to cancel their reservationsadr: both cancellations and noncancellations exhibit similar rates as adr values increase, indicating that after a certain amount, both categories are likely to exhibit similar behaviors.num_of_guests: guests with fewer party sizes are likely to cancel more often than with larger, likely due to the nature of it being more flexible to cancel with fewer members in their grouplength_of_stay: length of stay exhibits less cancellations with shorter periods, but cancellations are still likely to occur during short visits due to the flexibility of being less committed to longer stays.
# Scatterplot of Lead Time vs ADR with Cancellation Status
# Create copy of df to not modify original data and map binary cancellation w/ readable labels
scatter_copy = copy.deepcopy(df)
scatter_copy['is_canceled'] = scatter_copy['is_canceled'].map({0: 'Not Canceled', 1: 'Canceled'})
# Set figure size
plt.figure(figsize=(12, 7))
# Create scatterplot lead time vs ADR
sns.scatterplot(data=scatter_copy, x='lead_time', y='adr', hue='is_canceled', s=50, alpha=0.7, palette='pastel')
# Set title, labels, add legend
plt.title('Scatterplot of Lead Time vs ADR with Cancellation Status')
plt.xlabel('Lead Time')
plt.ylabel('Average Daily Rate (ADR)')
plt.legend(title='Cancellation Status')
# Cleaner look and display
plt.tight_layout()
plt.show()
Analysis:¶
- The scatterplot reveals that most reservations that have high average daily rates and high lead times get cancelled.
- The cancellations appear to cluster more as lead time increases and average daily rate stays around the median range. This could indicate that higher lead times are more likely to result in cancellations, which is something stakeholders can take into consideration when planning their operations.
- The non-cancellations cluster more towards lower lead times and median average daily rate. This makes sense because the less time they have to wait between making a reservation and arriving to the hotel, the less likely guests are to cancel.
# Box Plot Displaying the Correlation between Length of Stay and Cancellation Status
# Set figure size
plt.figure(figsize=(10, 6))
# Create box plot to display correlation length of stay and cancellation
ax = sns.boxplot(x='is_canceled', y='length_of_stay', data=df,
hue='is_canceled', palette='pastel')
# Create dictionary to map cancellation status
labels = {0: 'Not Canceled', 1: 'Canceled'}
# Set the xaxis, and assign labels
ax.set_xticks([0, 1])
ax.set_xticklabels([labels[0], labels[1]])
# Set titles and labels, remove the legend
plt.xlabel('Cancellation Status')
plt.ylabel('Length of Stay')
plt.title('Length of Stay by Cancellation Status')
ax.legend_.remove()
# Display
plt.show()
Analysis:¶
Median Comparison: The medians for both
CancelledandNot Cancelledcategories are quite similar. This means that, on average, guests who cancel their bookings stay for about the same amount of time as those who don’t.Extreme Values in
Not Cancelled: For guests who do not cancel, there are more extreme values in the length of stay. This suggests that guests who end up keeping their reservations tend to have a wider range of stay durations, with some staying much longer than others.Consistency in
Cancelled: On the other hand, theCancelledgroup has fewer extreme values. This indicates that guests who cancel their bookings tend to have more consistent stay lengths. Longer stays might be more prone to cancellation, possibly due to changing plans or uncertainties.Even though the medians are similar, the spread of stay lengths shows that there might be more to uncover. It would be helpful to dive deeper into the data to see if length of stay really impacts cancellation rates.
# Distribution of Average Daily Rate
# Set style of seaborn plot and context
sns.set_style("darkgrid")
sns.set_context("talk")
# Set figure size
plt.figure(figsize=(10, 6))
# Create Histogram with KDE
sns.histplot(data=df, x='adr', kde=True, bins=20, color='skyblue', line_kws={'color': 'darkblue', 'lw': 3})
# Set title, labels, grid lines
plt.title('Distribution of Average Daily Rate', fontsize=16, weight='bold')
plt.xlabel('Average Daily Rate', fontsize=14)
plt.ylabel('Frequency', fontsize=14)
plt.grid(True, linestyle='--', alpha=0.7)
# Display
plt.show()
# Statistics
df['adr'].describe()
count 78017.000000 mean 103.314852 std 46.218942 min 0.000000 25% 72.250000 50% 98.000000 75% 132.000000 max 230.000000 Name: adr, dtype: float64
Distribution Insights:¶
Average Daily Rate Overview:
- The average daily rate (ADR) of hotel stays ranges from a minimum of 0 to a maximum of 235.
- The mean ADR is approximately $106.40.
- The median ADR is close to the mean, at $100.07, suggesting a relatively symmetric distribution of ADR values around the center.
Variability in ADR:
- The standard deviation of 46.34 reveals a significant spread around the average ADR.
- This suggests that there is considerable variability in the prices of hotel rooms.
- Some guests are paying much more or less than the average rate, reflecting a range of pricing strategies or room types.
Percentile Distribution:
- The 25th percentile at 75.00 indicates that 25% of the ADR values are below 75.00.
- The 75th percentile at 135.00 shows that 75% of the ADR values are below 135.00.
Histogram and Kernel Density Estimate (KDE):
- The peak of the histogram and KDE curve suggests that most ADR values are concentrated around $100, with a gradual decline in frequency for higher and lower ADR values.
Observations from the Plot:
- The distribution shows a skew towards higher ADR values, with a notable frequency of ADRs around the $100 mark.
- There is a long tail on the higher end, indicating that while most ADRs are clustered around the middle range, there are some high-value outliers.
- We decided to keep those high-value outliers as they are probably from reserving rooms with more features and ammenities.
# Correlation Heatmap of Attributes
# Create copy of df to not modify original data
df_temp = copy.deepcopy(df)
# Converting boolean columns to integers in the temporary DataFrame
bool_cols = ['hotel', 'is_repeated_guest', 'deposit_type_No Deposit', 'deposit_type_Non Refund', 'deposit_type_Refundable']
df_temp[bool_cols] = df_temp[bool_cols].astype(int)
# Select column for correlation analysis, and compute
correlation_cols = df_temp[['is_canceled', 'lead_time', 'previous_cancellations', 'booking_changes', 'adr', 'num_of_guests', 'length_of_stay'] + bool_cols]
correlation_matrix = correlation_cols.corr()
# Set figure size
plt.figure(figsize=(12, 10))
# Create heatmap
sns.heatmap(correlation_matrix, annot=True, cmap="viridis", linewidths=0.5, fmt=".2f")
# Set title and display
plt.title('Correlation Heatmap of Attributes')
plt.show()
Analysis:¶
General Trends:
- Average Daily Rate (
adr): The positive correlation with bothlead_time(0.067) andnum_of_guests(0.417) suggests that as the average daily rate increases, bookings tend to have longer lead times and more guests. This can indicate that higher rates might be associated with more advanced planning and potentially larger groups, both of which could impact cancellation behavior. Higher rates could also mean higher stakes, possibly leading to a greater chance of cancellation if plans change.
- Average Daily Rate (
Low Correlations:
- Booking Cancellations (
is_canceled): The weak correlations with most features suggest that cancellations are influenced by multiple factors, but not strongly by any single one. The stronger correlations withlead_time(0.174) andadr(0.125) indicate that cancellations might be somewhat more likely with longer lead times and higher rates. This aligns with intuition, as guests may have second thoughts or face changing circumstances over a longer period, or higher rates might lead to reconsideration or financial constraints.
- Booking Cancellations (
Significant Negative Correlations:
- Deposit Types:
deposit_type_No Depositanddeposit_type_Non Refund: The strong negative correlation (-0.955) indicates that bookings with no deposit are rarely non-refundable. This can be useful for predicting cancellations because if a booking has no deposit, it might be more flexible and therefore more likely to be canceled.deposit_type_No Depositanddeposit_type_Refundable: The negative correlation (-0.292) shows that bookings with no deposit are less likely to be refundable. This might affect cancellation behavior because non-refundable deposits can deter cancellations due to the financial penalty, while no deposit options might lead to higher cancellation rates.
- Deposit Types:
Moderate Correlations:
- Lead Time and Length of Stay: The moderate positive correlation (0.333) between
lead_timeandlength_of_staysuggests that bookings with longer lead times tend to have longer stays. This relationship can help predict cancellations because longer bookings with extended lead times might be more vulnerable to changes in plans or unexpected events. Thus, longer lead times could potentially be associated with higher cancellation rates. - Number of Guests and Length of Stay: The small positive correlation (0.095) between
num_of_guestsandlength_of_stayindicates that the number of guests has a minor influence on the duration of their stay. This relationship is less significant, but it could still be relevant for understanding cancellation trends if larger groups are more likely to cancel or adjust plans.
- Lead Time and Length of Stay: The moderate positive correlation (0.333) between
Low Impact Features:
- Previous Cancellations and Booking Changes: Both features having low correlations with other attributes suggest they have less impact on cancellation predictions. However, analyzing these features individually might still provide insights into patterns of cancellations, especially if past behavior or frequent changes in bookings are indicators of a higher likelihood of cancellation.
Hotel Type and Repeated Guests:
- Correlation Insights: The weak correlations of
hotelandis_repeated_guestwith other features suggest that the type of hotel and whether a guest is a repeat customer don’t strongly affect other attributes. However, understanding these factors could still be useful in a broader context. For instance, repeat guests might have different cancellation behaviors compared to first-time guests, or certain types of hotels might have policies or pricing structures that influence cancellation rates.
- Correlation Insights: The weak correlations of
To leverage these correlations for predicting cancellations, we can focus on
adrandlead_timesince these features show some correlation with cancellations, and can be key predictors.
# Pie Chart Displaying Cancellation Rates for Each Deposit Type
deposit_grouped = df.groupby(
['deposit_type_No Deposit', 'deposit_type_Non Refund', 'deposit_type_Refundable', 'is_canceled']
).size().unstack(fill_value=0)
# Aggregating cancellations for each deposit type
cancellation_counts = deposit_grouped.loc[:, 1]
# Calculating cancellations for each deposit type directly
cancellation_no_deposit = df[(df['deposit_type_No Deposit'] == True) & (df['is_canceled'] == 1)].shape[0]
cancellation_non_refund = df[(df['deposit_type_Non Refund'] == True) & (df['is_canceled'] == 1)].shape[0]
cancellation_refundable = df[(df['deposit_type_Refundable'] == True) & (df['is_canceled'] == 1)].shape[0]
# Updating cancellation_counts with recalculated values
cancellation_counts = pd.Series({
'No Deposit': cancellation_no_deposit,
'Non Refund': cancellation_non_refund,
'Refundable': cancellation_refundable
})
# Calculating percentages
total_cancellations = cancellation_counts.sum()
percentages = (cancellation_counts / total_cancellations) * 100
# Creating labels for the legend with percentages
legend_labels = [f'{label} - {percent:.1f}%' for label, percent in zip(cancellation_counts.index, percentages)]
colors = ['#ff9999', '#66b3ff', '#99ff99']
# Creating the pie chart without labels and percentages on the pie itself
plt.figure(figsize=(10, 7))
patches, _ = plt.pie(cancellation_counts, startangle=90, colors=colors, wedgeprops={'edgecolor': 'black'}, labels=None)
# Adjusting legend to include percentages
plt.legend(patches, legend_labels, loc='center left', bbox_to_anchor=(1, 0.5), fontsize=8, title='Deposit Type')
plt.title('Cancellations by Deposit Type')
plt.show()
# Displaying the number of cancellations for each deposit type
print('Number of Cancellations by Deposit Type:')
cancellation_counts
Number of Cancellations by Deposit Type:
No Deposit 20554 Non Refund 654 Refundable 23 dtype: int64
Analysis¶
No Deposit:
- Higher Cancellation Rates: Most common deposit type; likely associated with higher cancellations due to lower guest commitment.
- Potential Indicator: Frequent cancellations suggest that requiring a deposit could improve guest commitment and reduce cancellations.
Non-Refundable:
- Lower Cancellation Rates: Higher commitment due to financial penalty, resulting in fewer cancellations.
- Predictive Value: Lower cancellations indicate that guests are more committed when a financial penalty is involved.
Refundable:
- Lowest Cancellation Rates: Offers flexibility but might still result in cancellations if guests are uncertain.
- Predictive Insights: Low number of refundable bookings and low cancellations suggest that while flexibility is valued, it doesn’t guarantee commitment.
Question 1: What is the impact of repeat guest status on cancellation rates across different hotel types?¶
# Cancellation Rates by Repeat Status in Each Hotel Type
# Group data by repeated guest status/hotel type and calculate mean
cancellation_rate = df.groupby(['is_repeated_guest', 'hotel'])['is_canceled'].mean().reset_index()
# Mapping boolean values to appropriate labels
cancellation_rate['is_repeated_guest'] = cancellation_rate['is_repeated_guest'].map({True: 'Repeated', False: 'Not Repeated'})
cancellation_rate['hotel'] = cancellation_rate['hotel'].map({True: 'City Hotel', False: 'Resort Hotel'})
# Setting plot style and context with custom colors
sns.set_style("darkgrid")
sns.set_context("talk")
colors = ['#66c2a5', '#fc8d62']
# Create a bar plot to visualize cancellation rate
plt.figure(figsize=(10, 6))
sns.barplot(x='is_repeated_guest', y='is_canceled', hue='hotel', data=cancellation_rate, palette=colors)
# Set title and labels
plt.title('Cancellation Rate by Repeated Guest Status and Hotel Type', fontsize=16, weight='bold')
plt.xlabel('Repeated Guest Status', fontsize=14)
plt.ylabel('Cancellation Rate', fontsize=14)
# Customize legend, add grid for readability
plt.legend(title='Hotel Type', fontsize=12, title_fontsize='13')
plt.grid(True, linestyle='--', alpha=0.7)
plt.show()
# Output cancellation rate
cancellation_rate
| is_repeated_guest | hotel | is_canceled | |
|---|---|---|---|
| 0 | Not Repeated | Resort Hotel | 0.229029 |
| 1 | Not Repeated | City Hotel | 0.311461 |
| 2 | Repeated | Resort Hotel | 0.037760 |
| 3 | Repeated | City Hotel | 0.105911 |
Non-Repeated Guests Cancellation Rate¶
City Hotel: Non-repeated guests have a cancellation rate of ~32%.
Resort Hotel: Non-repeated guests have a cancellation rate of ~23%.
Non-repeated guests tend to cancel more often, especially in city hotels. This could indicate that first-time visitors to city hotels are more likely to have unpredictable plans or different expectations, leading to cancellations. Resort hotels seem to fare slightly better with non-repeated guests in terms of cancellations.
Repeated Guests Cancellation Rate¶
City Hotel: Repeated guests have a cancellation rate of ~11%.
Resort Hotel: Repeated guests have a much lower cancellation rate of ~4%.
Guests who have stayed before tend to cancel less frequently, especially at resort hotels, where the cancellation rate is very low. This suggests that resort hotels are successful at retaining loyal guests who follow through with their reservations, possibly through rewards programs or exceptional service.
City Hotels Experience Higher Cancellation Rates¶
- Across both guest types (repeated and non-repeated), city hotels face higher cancellation rates.
- This trend might reflect the different nature of trips for city hotels, where plans are possibly more subject to change, compared to resort stays that are likely more planned and leisure-oriented.
Question 2: How do cancellation rates change over time, and are there noticeable trends or patterns in cancellations for each type of hotel?¶
# Relationship of month_year and cancellations across the two types of hotels
monthly_cancellations = df.groupby(['month_year', 'hotel'])['is_canceled'].mean().reset_index()
# Converting 'month_year' to string format for plotting
monthly_cancellations['month_year'] = monthly_cancellations['month_year'].astype(str)
monthly_cancellations['hotel'] = monthly_cancellations['hotel'].map({True: 'City Hotel', False: 'Resort Hotel'})
# Create line plot
plt.figure(figsize=(12, 6))
sns.lineplot(data=monthly_cancellations, x='month_year', y='is_canceled', hue='hotel', marker='o')
# Rotate x-axis, add labels, and title
plt.xticks(rotation=45)
plt.xlabel('Month-Year')
plt.ylabel('Average Cancellation Rate')
plt.title('Monthly Average Cancellation Rate by Hotel Type')
# Customize legend
plt.legend(title='Hotel Type')
# Clean and display
plt.tight_layout()
plt.show()
Analysis of Monthly Cancellation Rates¶
Trends Over Time:¶
The cancellation rates appear to fluctuate over time for both hotels. Cancellation rates tend to be higher during the first few months of the year and gradually decrease and start to pick back up again towards the end of the year. An explanation for this is that this time is during the holidays and usually around the time when severe storms impact travel. As a result, many are left to cancel their travel plans and reservations. More cancellations are noted for city hotels during these periods, which could indicate that guests are more likely to cancel reservations for city hotels than resort hotels.
An interesting trend to note is that cancellations begin to decrease for both hotels at the beginning of the year 2017. Several factors could be in a play for this: one is hotels might have updated their cancellation policies or required deposits to reduce cancellations. Another reason could be that 2017 had an unusual pattern in demand or seasonality (especially since it had the lowest rates during the summer months).
The implications for this trend is that hotels should prepare for cancellations during the holiday seasons due to external or internal factors that prevent guests from going through with their reservation. The summer season when children are out of school tend to have fewer cancellations, which could imply that family groups are vacationing and thus less likely to cancel their reservations, which is apparent in the graph. Therefore, we can predict cancellations to occur more often during the last and first few months of the year.
Question 3: Which customer type is more likely to cancel?¶
def cancellation_by_customer_type(df):
"""
Function to calculate the cancellation rate by customer type.
Args:
df (pandas.DataFrame): DataFrame containing the columns 'customer_type' and 'is_canceled'.
Returns:
pandas.DataFrame: A DataFrame with customer types and their corresponding cancellation rates.
"""
# Group by customer_type and calculate the cancellation rate
cancellation_rate = df.groupby('customer_type')['is_canceled'].mean().reset_index()
# Rename the columns for better readability
cancellation_rate.columns = ['customer_type', 'cancellation_rate']
# Convert cancellation rate to percentage for easier interpretation
cancellation_rate['cancellation_rate'] = cancellation_rate['cancellation_rate'] * 100
return cancellation_rate
result = cancellation_by_customer_type(df)
print(result)
customer_type cancellation_rate 0 Contract 15.224417 1 Group 6.490872 2 Transient 29.133241 3 Transient-Party 17.274130
def plot_violin_cancellation_by_customer_type(df):
"""
Function to plot a violin graph showing the distribution of cancellations by customer type.
Args:
df (pandas.DataFrame): DataFrame containing the columns 'customer_type' and 'is_canceled'.
"""
# Set up the plot size and style
plt.figure(figsize=(10, 6))
sns.set(style="whitegrid")
# Create the violin plot
sns.violinplot(x='customer_type', y='is_canceled', data=df, inner="point", hue='customer_type', palette="muted", legend=False)
# Add title and labels
plt.title('Distribution of Cancellations by Customer Type', fontsize=16)
plt.xlabel('Customer Type', fontsize=12)
plt.ylabel('Cancellation (0 = No, 1 = Yes)', fontsize=12)
# Show the plot
plt.show()
plot_violin_cancellation_by_customer_type(df)
Analysis¶
Cancellation Distributions by Customer Type¶
Broader plots indicate variability in the likelihood of cancellations, unpredictability, and diverse reasons for cancelling. Transient customers have the most cancellation rate over time. Group customers tend to have a lower cancellation rate as compare to the other in the group. This could be due to the fact that with a lot of people in a group, it causes the cancellation rate to be significantly lower as opposed to its peers.
The implication we can infer from this data suggest that hotels should focus on offering families holiday packages to incite them toward traveling in groups and maintain a collective, rigid agenda which will press these customers to commit to the vacation and their booking as a whole.
Predictive value: tailoring vacation trips for families with good incentives will ensure business profitability and expansion.
4. Additional Analysis¶
UMAP Dimensionality Reduction Methods¶
- UMAP is an algorithm for dimension reduction based on manifold learning techniques and ideas from topological data analysis (https://umap-learn.readthedocs.io/en/latest/how_umap_works.html).
- It provides a general framework for approaching manifold learning and dimension reduction, as well as providing specific concrete realizations.
- Essentially, it aims to uncover the underlying structure of data by assuming that high-dimensional data lies on a lower-dimensional manifold.
- UMAP leverages ideas from topological data to preserve the shape of the data and ensure that similar points remain close together.
- The key steps in UMAP are to constrcut a high-dimensional graph, create a low-dimensional graph, and optimization.
# UMAP
# Integer columns for UMAP
int_columns = ['is_canceled', 'lead_time', 'previous_cancellations', 'booking_changes', 'num_of_guests', 'length_of_stay']
data = df[int_columns]
# Standardizing the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Initializing UMAP model
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2)
# Applying UMAP
embedding = reducer.fit_transform(data_scaled)
# Set figure size,
plt.figure(figsize=(10, 7))
# Creat scatterplot of UMAP, add colorbar
plt.scatter(embedding[:, 0], embedding[:, 1], c=df['is_canceled'], cmap='Spectral', s=5, alpha=0.5)
plt.colorbar(label='Is Canceled')
# Set titles, labels
plt.title('UMAP Projection of Hotel Reservation Data')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
# Display
plt.show()
UMAP allows us to identify groups of bookings with similar characteristics and see clear patterns of cancellation thanks to the visual dimensionality reduction, decreasing the complexity of multiple features down to 2 dimensions. Clusters of bookings exhibiting high cancellation rates (i.e. bookings with long lead times and several previous cancelations aggregating together) helps tailor policies like tiered cancellation fees based on lead time or more flexible modification options to decrease cancellations.
# UMAP
int_columns = ['is_canceled', 'lead_time', 'previous_cancellations', 'booking_changes', 'num_of_guests', 'length_of_stay']
data = df[int_columns]
# Standardizing the data
scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)
# Initialize UMAP model
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, n_components=2)
# Precomputing embeddings for each frame by slightly varying the min_dist parameter
embeddings = []
for frame in range(20): # Reducing the number of frames to 20
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1 + 0.01 * frame, n_components=2)
embedding = reducer.fit_transform(data_scaled)
embeddings.append(embedding)
# Setting up the figure for animation
fig, ax = plt.subplots(figsize=(10, 7))
# Initial scatter plot
sc = ax.scatter(embeddings[0][:, 0], embeddings[0][:, 1], c=df['is_canceled'], cmap='Spectral', s=5, alpha=0.5)
plt.colorbar(sc, label='Is Canceled')
plt.title('UMAP Projection of Hotel Reservation Data')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
# Function to update the plot for each frame
def update(frame):
ax.clear()
# Update scatter plot with precomputed embeddings
embedding = embeddings[frame]
sc = ax.scatter(embedding[:, 0], embedding[:, 1], c=df['is_canceled'], cmap='Spectral', s=5, alpha=0.5)
plt.title(f'UMAP Projection of Hotel Reservation Data (Frame {frame})')
plt.xlabel('UMAP Dimension 1')
plt.ylabel('UMAP Dimension 2')
return sc,
# Creating the animation
ani = FuncAnimation(fig, update, frames=20, interval=300)
# Displaying the animation in the notebook
from IPython.display import HTML
HTML(ani.to_jshtml())
# Saving as GIF
# ani.save('umap_animation.gif', writer='imagemagick')
c:\Users\Juan Dominguez\AppData\Local\Programs\Python\Python312\Lib\site-packages\sklearn\manifold\_spectral_embedding.py:329: UserWarning: Graph is not fully connected, spectral embedding may not work as expected. warnings.warn( c:\Users\Juan Dominguez\AppData\Local\Programs\Python\Python312\Lib\site-packages\umap\spectral.py:550: UserWarning: Spectral initialisation failed! The eigenvector solver failed. This is likely due to too small an eigengap. Consider adding some noise or jitter to your data. Falling back to random initialisation! warn( c:\Users\Juan Dominguez\AppData\Local\Programs\Python\Python312\Lib\site-packages\umap\spectral.py:550: UserWarning: Spectral initialisation failed! The eigenvector solver failed. This is likely due to too small an eigengap. Consider adding some noise or jitter to your data. Falling back to random initialisation! warn(
The animation shows how UMAP evolves with min_dist parameter varying, causing relationship between datapoints to go from tight and local (small neighborhoods of similar bookings) to spread as mid_dist increases into a more global structure (of broader relationships between groups of bookings).